{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cellular-malpractice",
   "metadata": {},
   "source": [
    "# Loading Text Dataset\n",
    "\n",
    "`Linux` `Ascend` `GPU` `CPU` `Data Preparation` `Beginner` `Intermediate` `Expert`\n",
    "\n",
    "[![](https://gitee.com/mindspore/docs/raw/master/tutorials/training/source_en/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/master/tutorials/training/source_en/use/load_dataset_text.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "united-implementation",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "The `mindspore.dataset` module provided by MindSpore enables users to customize their data fetching strategy from disk. At the same time, data processing and tokenization operators are applied to the data. Pipelined data processing produces a continuous flow of data to the training network, improving overall performance.\n",
    "\n",
    "In addition, MindSpore supports data loading in distributed scenarios. Users can define the number of shards while loading. For more details, see [Loading the Dataset in Data Parallel Mode](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html#loading-the-dataset-in-data-parallel-mode).\n",
    "\n",
    "This tutorial briefly demonstrates how to load and process text data using MindSpore.\n",
    "\n",
    "## Preparations\n",
    "\n",
    "1. Prepare the following text data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "green-characteristic",
   "metadata": {},
   "source": [
    "    Welcome to Beijing!\n",
    "\n",
    "    北京欢迎您！\n",
    "\n",
    "    我喜欢English!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "verified-oakland",
   "metadata": {},
   "source": [
    "2. Create the `tokenizer.txt` file, copy the text data to the file, and save the file under `./datasets` directory. The directory structure is as follow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "forbidden-positive",
   "metadata": {},
   "outputs": [],
   "source": [
    "    import os\n",
    "\n",
    "    if not os.path.exists('./datasets'):\n",
    "        os.mkdir('./datasets')\n",
    "    file_handle=open('./datasets/tokenizer.txt',mode='w')\n",
    "    file_handle.write('Welcome to Beijing \\n北京欢迎您！ \\n我喜欢English! \\n')\n",
    "    file_handle.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "geological-indicator",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./datasets\n",
      "├── MNIST_Data\n",
      "│   ├── test\n",
      "│   │   ├── t10k-images-idx3-ubyte\n",
      "│   │   └── t10k-labels-idx1-ubyte\n",
      "│   └── train\n",
      "│       ├── train-images-idx3-ubyte\n",
      "│       └── train-labels-idx1-ubyte\n",
      "└── tokenizer.txt\n",
      "\n",
      "3 directories, 5 files\n"
     ]
    }
   ],
   "source": [
    "    ! tree ./datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "continuous-cornell",
   "metadata": {},
   "source": [
    "3. Import the `mindspore.dataset` and `mindspore.dataset.text` modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "handled-cooking",
   "metadata": {},
   "outputs": [
   ],
   "source": [
    "    import mindspore.dataset as ds\n",
    "    import mindspore.dataset.text as text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incorporated-glasgow",
   "metadata": {},
   "source": [
    "## Loading Dataset\n",
    "\n",
    "MindSpore supports loading common datasets in the field of text processing that come in a variety of on-disk formats. Users can also implement custom dataset class to load customized data. For detailed loading methods of various datasets, please refer to the [Loading Dataset](https://www.mindspore.cn/doc/programming_guide/en/master/dataset_loading.html) chapter in the Programming Guide.\n",
    "\n",
    "The following tutorial demonstrates loading datasets using the `TextFileDataset` in the `mindspore.dataset` module.\n",
    "\n",
    "1. Configure the dataset directory as follows and create a dataset object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "therapeutic-press",
   "metadata": {},
   "outputs": [],
   "source": [
    "    DATA_FILE = \"./datasets/tokenizer.txt\"\n",
    "    dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "split-mustang",
   "metadata": {},
   "source": [
    "2. Create an iterator then obtain data through the iterator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "collect-consortium",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Welcome to Beijing \n",
      "北京欢迎您！ \n",
      "我喜欢English! \n"
     ]
    }
   ],
   "source": [
    "    for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "        print(text.to_str(data['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "concrete-victor",
   "metadata": {},
   "source": [
    "## Processing Data\n",
    "\n",
    "For the data processing operators currently supported by MindSpore and their detailed usage methods, please refer to the [Processing Data](https://www.mindspore.cn/doc/programming_guide/en/master/pipeline.html) chapter in the Programming Guide\n",
    "\n",
    "The following tutorial demonstrates how to construct a pipeline and perform operations such as `shuffle` and `RegexReplace` on the text dataset.\n",
    "\n",
    "1. Shuffle the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "shaped-conditions",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我喜欢English! \n",
      "Welcome to Beijing \n",
      "北京欢迎您！ \n"
     ]
    }
   ],
   "source": [
    "    ds.config.set_seed(58)\n",
    "    dataset = dataset.shuffle(buffer_size=3)\n",
    "\n",
    "    for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "        print(text.to_str(data['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "figured-interval",
   "metadata": {},
   "source": [
    "2. Perform `RegexReplace` on the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "boring-result",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我喜欢English! \n",
      "Welcome to Shanghai \n",
      "上海欢迎您！ \n"
     ]
    }
   ],
   "source": [
    "    replace_op1 = text.RegexReplace(\"Beijing\", \"Shanghai\")\n",
    "    replace_op2 = text.RegexReplace(\"北京\", \"上海\")\n",
    "    dataset = dataset.map(operations=replace_op1)\n",
    "    dataset = dataset.map(operations=replace_op2)\n",
    "\n",
    "    for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "        print(text.to_str(data['text']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "challenging-trunk",
   "metadata": {},
   "source": [
    "## Tokenization\n",
    "\n",
    "For the data tokenization operators currently supported by MindSpore and their detailed usage methods, please refer to the [Tokenizer](https://www.mindspore.cn/doc/programming_guide/en/master/tokenizer.html) chapter in the Programming Guide.\n",
    "\n",
    "The following tutorial demonstrates how to use the `WhitespaceTokenizer` to tokenize words with space.\n",
    "\n",
    "1. Create a `tokenizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "respiratory-lighter",
   "metadata": {},
   "outputs": [],
   "source": [
    "    tokenizer = text.WhitespaceTokenizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accepting-arrest",
   "metadata": {},
   "source": [
    "2. Apply the `tokenizer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "human-enforcement",
   "metadata": {},
   "outputs": [],
   "source": [
    "    dataset = dataset.map(operations=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rapid-paragraph",
   "metadata": {},
   "source": [
    "3. Create an iterator and obtain data through the iterator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "external-athletics",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['我喜欢English!']\n",
      "['Welcome', 'to', 'Shanghai']\n",
      "['上海欢迎您！']\n"
     ]
    }
   ],
   "source": [
    "    for data in dataset.create_dict_iterator(output_numpy=True):\n",
    "        print(text.to_str(data['text']).tolist())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore-1.1.1",
   "language": "python",
   "name": "mindspore-1.1.1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
